"You rarely want to use DataFrame.apply"

Tom Augspurger, one of the maintainers of Python's Pandas library for data analysis, has an awesome series of blog posts on writing idiomatic Pandas code. In fact you should probably leave this site now and go read one of those blog posts, they're really good. His post on Performance has an especially interesting tip:

"You rarely want to use DataFrame.apply and almost never should use it with axis=1 [which processes the DataFrame row-by-row, "across columns"]. Better to write functions that take arrays and pass those in directly..."

In Tom's example, he has a function with numpy math function calls, and he shows that his function works dramatically faster when those numpy functions are passed entire columns as arguments, which can be processed as vectors. Using .apply(), on the other and, calls those functions on one number at a time through a loop.

Trying out this advice on a simple text-processing function

It certainly makes sense that the vectorized approach --passing whole DataFrame columns to a function which accepts array(s) as input-- should provide significant speedup for functions with numpy math function calls. We expect those can operate on arrays/vectors directly. But what about text processing?

Here, I'll take a simple text-processing function I've used with .apply() before, and compare its performance with a slightly modified version meant to accept whole DataFrame columns instead of single strings.

First, let's load our dataset - the Quora Duplicate Questions dataset released earlier this year.



In [2]:

    
import pandas as pd

df = pd.read_csv('datasets/quora_kaggle.csv')
df.head(3)









    Out[2]:






  
    
      
      id
      qid1
      qid2
      question1
      question2
      is_duplicate
    
  
  
    
      0
      0
      1
      2
      What is the step by step guide to invest in sh...
      What is the step by step guide to invest in sh...
      0
    
    
      1
      1
      3
      4
      What is the story of Kohinoor (Koh-i-Noor) Dia...
      What would happen if the Indian government sto...
      0
    
    
      2
      2
      5
      6
      How can I increase the speed of my internet co...
      How can Internet speed be increased by hacking...
      0

A simple text-processing function, and baseline speed test

The function I'll be testing is a simple text-processing function for tokenizing a string - returning the string as a list of words, after doing a bit of preprocessing.



In [3]:

    
import re
from nltk.corpus import stopwords

def tokenize(text):
    ''' Accept a string, return list of words (lowercased) without punctuation or stopwords'''

    # lowercase everything
    text = text.lower()
    
    # remove punctuation (\W matches any non-alphanumeric character)
    text = re.sub("\W", " ", text)
    
    # return list of words, without stopwords (stopwords are very common words which may not convey much info)
    droplist = stopwords.words('english')
    
    return [word for word in text.split() if word not in droplist]
    
tokenize('This is a sentence. And another one with punctuation and special characters to strip!?*&^%')









    Out[3]:





['sentence', 'another', 'one', 'punctuation', 'special', 'characters', 'strip']

It takes about 49 seconds to apply this function to all of our 'question1' questions using .apply():



In [4]:

    
from datetime import datetime

start = datetime.now()
df['q1_tokenized'] = df['question1'].apply(tokenize) 

print('Time elapsed: ', datetime.now() - start, '\n')
print(df[['question1', 'q1_tokenized']].head(3))









    



Time elapsed:  0:01:10.699163 

                                           question1  \
0  What is the step by step guide to invest in sh...   
1  What is the story of Kohinoor (Koh-i-Noor) Dia...   
2  How can I increase the speed of my internet co...   

                                        q1_tokenized  
0  [step, step, guide, invest, share, market, india]  
1              [story, kohinoor, koh, noor, diamond]  
2  [increase, speed, internet, connection, using,...

..............................................

NOTE: replace 53 seconds above with 0:00:49.371585, the middle time of three runs from my shell. times were more consistent when running from my shell, think the gc provides more comparable times to compare this with next run.

................................................

The Vectorized Approach

Let's see if we can speed this up by modifying our tokenize function to accept a Pandas Series of strings, instead of a single string. That way we won't have to use .apply().



In [5]:

    
def tokenize2(text_series):
    ''' Accept a series of strings, returns list of words (lowercased) without punctuation or stopwords'''

    # lowercase everything
    text_series = text_series.str.lower()
    
    # remove punctuation (r'\W' is regex, matches any non-alphanumeric character)
    text_series = text_series.str.replace(r'\W', ' ')
    
    # return list of words, without stopwords
    sw = stopwords.words('english')
    
    return text_series.apply(lambda row: [word for word in row.split() if word not in sw])

And to measure performance of the (mostly) vectorized approach:



In [6]:

    
start = datetime.now()
df['q1_tokenized'] = tokenize2(df['question1'])

print('Time elapsed: ', datetime.now() - start, '\n')
print(df[['question1', 'q1_tokenized']].head(3))









    



Time elapsed:  0:00:15.850432 

                                           question1  \
0  What is the step by step guide to invest in sh...   
1  What is the story of Kohinoor (Koh-i-Noor) Dia...   
2  How can I increase the speed of my internet co...   

                                        q1_tokenized  
0  [step, step, guide, invest, share, market, india]  
1              [story, kohinoor, koh, noor, diamond]  
2  [increase, speed, internet, connection, using,...

--- REPLACE WITH 0:00:11.859043 ... same explanation as above, memory leak in notebook causing increasing times

Conclusion

Vectorizing our tokenizing function netted a >4X speedup. And just as importantly (to me anyway, in most cases): we didn't have to sacrifice code clarity to get the performance gain.

We got this speedup using just two built-in Series.str functions, even with a Series.apply() at the end of tokenize2 that I couldn't figure out quickly how to vectorize (though I bet there's a way to do it). And the code barely changed; to modify the function to accept a Series of strings instead of a string, I just changed:

text.lower to text_series.str.lower(), and
re.sub(..., text) to text_series.str.replace(...)

This was a really nifty performance tip, especially considering how intuitive, and frankly, idiomatic, it feels to use DataFrame.apply() in so many cases. To quote Tom again: "it's very natural to have to translate an equation to code and think, 'Ok now I need to apply this function to each row", so you reach for DataFrame.apply.'" But as this example shows, vectorizing your functions to accept a whole Pandas Series at a time and avoid .apply() pays large dividends.

	id	qid1	qid2	question1	question2
0	0	1	2	What is the step by step guide to invest in sh...	What is the step by step guide to invest in sh...
1	1	3	4	What is the story of Kohinoor (Koh-i-Noor) Dia...	What would happen if the Indian government sto...
2	2	5	6	How can I increase the speed of my internet co...	How can Internet speed be increased by hacking...